{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "非常感谢您学习到这里，如果有时间的话希望能帮我填一个：[教程反馈](https://www.wjx.cn/jq/61058547.aspx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本教程中，你会学到结构化数据的存储及处理的思维方式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 复杂的数据处理任务应该如何切入？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 坚决不手动解决问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们做数据的人，有一点是我们的原则：\n",
    "\n",
    "<div class=\"alert alert-info\"><p>   除非有人把刀子架在你脖子上威胁，否则坚决不一条一条数据手动做任何匹配、复制、粘贴等机械操作\n",
    "    </p></div>\n",
    "   \n",
    "    \n",
    "遇到任何复杂的问题时，我们一定会想尽办法把问题变成可以批量解决的编程问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据处理任务应该如何分解？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在遇到一个数据处理任务的时候，我们应该首先应该思考的最重要的问题是：\n",
    "\n",
    "    这个数据处理任务的输入与输出是什么？\n",
    "   \n",
    "具体来说，这个问题又可以细化为：我们有什么样的数据？数据是如何存储的？我们要得到什么的结果？结果的数据又是什么形式的？\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/2.png\"  style=\"width:800px\">  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在搞清楚了输入与输出之后，下一步就是应该怎么把任务分为多个小任务：\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/3.png\"  style=\"width:800px\">  \n",
    "每个小任务的输出，又是下一个任务的输入  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 以数据表方式存储数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在复杂的数据处理任务中，最常用的是以数据表的形式存储数据，也就是\n",
    "<a href=\"https://baike.baidu.com/item/%E7%BB%93%E6%9E%84%E5%8C%96%E6%95%B0%E6%8D%AE/5910594?fr=aladdin\">结构化数据</a>  \n",
    "以表形式存储的数据，在处理的过程中可以很容易地用sql或python的pandas等工具来处理，处理效率非常高  \n",
    "而且，数据表的好处是，每添加一列，就可以增加一个维度的信息。  \n",
    "换句话说，如果有多个长得差不多的表，就可以把表都合起来变成一个表，增加一个字段，这个字段存储的是这部分数据来自哪个表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 举例分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据处理任务"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-08T12:13:18.391971Z",
     "start_time": "2019-09-08T12:13:18.387370Z"
    }
   },
   "source": [
    "举个例子：  \n",
    "假设现在有一批地铁出行数据，数据形式如下：\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/1.jpg\"  style=\"width:800px\">  \n",
    "  \n",
    "  \n",
    "这里面，每一行数据代表的是一个乘客的一次出行，每个数字代表不同的地铁站，一次出行会经过一连串的地铁站  \n",
    "从这个地铁出行数据中，我们的数据处理任务是：\n",
    "\n",
    "    提取出每个地铁换乘站的换乘量\n",
    "    \n",
    "即，假设1号地铁站连接1号线，2号线，3号线三条地铁线路，我们想要得到这三条线之间通过站点1互相换乘的流量是多少。  \n",
    "如果我们要一次性得到全市的所有换乘站点的所有换乘量，应该如何批量运算？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 思路"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 输入\n",
    "输入我们已经知道了，就是上面的数据形式  \n",
    "但是，输入的数据不是非常完美的结构化数据，因为它的每一行的长度都不同，一行里面包含多个轨迹点的信息。  \n",
    "我们期望得到的输入是，完美的结构化数据，每一行都是类似的结构：  \n",
    "\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/5.png\"  style=\"width:300px\">  \n",
    "而数据表的好处就是，每添加一列，我们就可以增加一个维度的信息  \n",
    "每个人每次出行的轨迹，我们没必要存成单独的表，我们只需要在前面加一列，标记它是第几个出行，以此来区分不同次的出行"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 输出\n",
    "输出是什么？很简单，全市的所有换乘站点的所有方向的换乘量  \n",
    "那么，为了包含这个信息，输出的数据表应该如何设计？  \n",
    "  \n",
    "需要包含的信息是，在哪个站点，从哪个线，换乘到哪个线，符合前面条件的数量是多少人？  \n",
    "\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/4.png\"  style=\"width:300px\">  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 任务分解\n",
    "搞清楚输入和输出后，就可以进行任务分解了，我们要解决的就是以下四个任务:  \n",
    "\n",
    "    任务1：由最原始的输入，整理为完美的表数据，得到表1\n",
    "    任务2：由表1得到表2，按顺序排列，看每个出行是经过了哪些线路\n",
    "    任务3：由表2得到表3，由经过的线路得到换乘表，每个出行，在哪个站点，由哪个线换乘到哪个线\n",
    "    任务4：由表3得到表4，将表3集计，在哪个站点，由哪个线换乘到哪个线的人数有多少\n",
    "\n",
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/6.jpg\"  style=\"width:1000px\">  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 作业"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-08T15:37:52.761018Z",
     "start_time": "2019-09-08T15:37:52.757304Z"
    }
   },
   "source": [
    "<img src=\"https://gitee.com/ni1o1/pygeo-tutorial/raw/master/resource/6/7.jpg\"  style=\"width:1000px\">  \n",
    "试计算深圳全市所有公交线路的线路动态重复度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-08T15:43:58.260726Z",
     "start_time": "2019-09-08T15:43:58.183013Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>linename</th>\n",
       "      <th>stationname</th>\n",
       "      <th>stationgeo</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>650路(锦程文丰公交场站-凤凰山脚)</td>\n",
       "      <td>['锦程文丰公交场站', '明士达公司', '鸿桥工业园西', '鸿桥工业园', '三洋部件...</td>\n",
       "      <td>['113.78947398104066,22.728276040706554', '113...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>650路(凤凰山脚-锦程文丰公交场站)</td>\n",
       "      <td>['凤凰山脚', '凤凰第二工业区', '凤凰台湾街', '凤凰社区', '凤凰广场', '...</td>\n",
       "      <td>['113.8559999718322,22.68862002580157', '113.8...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>m502a线(龙西公交总站-龙西公交总站)</td>\n",
       "      <td>['龙西公交总站', '添利工业园', '瓦窑坑', '五联社区', '崇和学校', '美信...</td>\n",
       "      <td>['114.25368997182035,22.759529012563924', '114...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>高快巴士39路(华富路②-坂田风门坳总站)</td>\n",
       "      <td>['华富路②', '宏杨学校', '坂田地铁站', '扬马市场', '金洲嘉丽园', '坂田...</td>\n",
       "      <td>['114.08788496287927,22.548925018156375', '114...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>高快巴士39路(坂田风门坳总站-华富路②)</td>\n",
       "      <td>['坂田风门坳总站', '岗头市场', '华为基地', '华为单身公寓北', '万科城', ...</td>\n",
       "      <td>['114.0748459685757,22.675367010065482', '114....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                linename                                        stationname  \\\n",
       "0    650路(锦程文丰公交场站-凤凰山脚)  ['锦程文丰公交场站', '明士达公司', '鸿桥工业园西', '鸿桥工业园', '三洋部件...   \n",
       "1    650路(凤凰山脚-锦程文丰公交场站)  ['凤凰山脚', '凤凰第二工业区', '凤凰台湾街', '凤凰社区', '凤凰广场', '...   \n",
       "2  m502a线(龙西公交总站-龙西公交总站)  ['龙西公交总站', '添利工业园', '瓦窑坑', '五联社区', '崇和学校', '美信...   \n",
       "3  高快巴士39路(华富路②-坂田风门坳总站)  ['华富路②', '宏杨学校', '坂田地铁站', '扬马市场', '金洲嘉丽园', '坂田...   \n",
       "4  高快巴士39路(坂田风门坳总站-华富路②)  ['坂田风门坳总站', '岗头市场', '华为基地', '华为单身公寓北', '万科城', ...   \n",
       "\n",
       "                                          stationgeo  \n",
       "0  ['113.78947398104066,22.728276040706554', '113...  \n",
       "1  ['113.8559999718322,22.68862002580157', '113.8...  \n",
       "2  ['114.25368997182035,22.759529012563924', '114...  \n",
       "3  ['114.08788496287927,22.548925018156375', '114...  \n",
       "4  ['114.0748459685757,22.675367010065482', '114....  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#读取数据\n",
    "import pandas as pd\n",
    "f = open(r'data-sample/busline.csv')\n",
    "busline = pd.read_csv(f)\n",
    "f.close()\n",
    "busline.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-08T15:44:54.038437Z",
     "start_time": "2019-09-08T15:44:54.033903Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2113"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(busline)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "204.109px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
